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Benchmark for Models Predicting Human Behavior 
in Gap Acceptance Scenarios 
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Abstract—Autonomous vehicles currently suffer from a time- 
inefficient driving style caused by uncertainty about human 
behavior in traffic interactions. Accurate and reliable prediction 
models enabling more efficient trajectory planning could make 
autonomous vehicles more assertive in such interactions. How- 
ever, the evaluation of such models is commonly oversimplistic, 
ignoring the asymmetric importance of prediction errors and 
the heterogeneity of the datasets used for testing. We examine 
the potential of recasting interactions between vehicles as gap 
acceptance scenarios and evaluating models in this structured 
environment. To that end, we develop a framework facilitating the 
evaluation of any model, by any metric, and in any scenario. We 
then apply this framework to state-of-the-art prediction models, 
which all show themselves to be unreliable in the most safety- 
critical situations. 


Index Terms—autonomous vehicles, gap acceptance, behavior 
prediction, benchmark. 


I. INTRODUCTION 


UCCESSFULLY implementing autonomous driving is 

one of the key technical challenges faced by the automo- 
tive industry as well as large parts of the research community, 
with tens of billions of dollars invested in recent years towards 
this goal [1]. The provision of those funds is motivated by 
several benefits promised by this technology. The foremost of 
these is safer driving, expressed by a significant decrease in 
accidents and, correspondingly, a reduction of bodily harm and 
financial losses. Additional advantages are also expected, such 
as more accessible mobility for people unable to drive or an 
easing of road congestion and traffic [2]-[4]. 

But despite all these investments, autonomous vehicles still 
suffer from many problems preventing widespread use [5], 
[6]. One such problem is their timidity in interactions with 
human traffic participants, caused by the uncertainty about 
the future behavior of those human agents. This uncertainty 
can prevent the autonomous vehicle from taking the most 
time-efficient actions if the resulting probability of a crash or 
near-crash is too high, resulting in the cautious driving style 
observed. Paradoxically, this can also be a safety risk, as such 
caution by an autonomous vehicle is often not expected by 
the surrounding humans, which can result in accidents such 
as being rear-ended [5], [7]. 

To reduce this uncertainty and to allow for a more efficient 
driving style without compromising on safety requirements, 
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The source code, trained models, and data can be found online at public 
Github repository 


Models 
Binary 


Datasets 
Trajectory 


Logistic 
regression 


Trajectron ++ 


Random forest 


AgentFormer 


Deep belief 
network 


Meta-heuristic 
model 


Gap acceptance 
benchmark 


Metrics 


Binary | Trajectory 


True negative rate Average displacement 


error 


under perfect recall 


Accuracy 


Area under curve 


Fig. 1. The proposed framework allows researchers to evaluate the perfor- 
mance of any prediction model for human behavior according to any metric 
on any dataset including gap acceptance scenarios. 


behavior prediction models can be used [8], [9], which project 
the future position of traffic participants. Those can range from 
models able to deal with any kind of traffic participant [10]— 
[12] to others being focused on predicting the behavior of 
a specific kind of participant, such as cars [13]-[19] or 
pedestrians [20]—[24]. 

However, the utility of those models—primarily designed 
to minimize the necessary trade-off between safety and effi- 
ciency in trajectory planning—is questionable, as the common 
methods for their evaluation diverge from the models’ purpose. 
First, most common metrics for evaluating prediction models, 
such as the final or average displacement error, ignore that the 
consequences of a false prediction are inherently asymmetric 
[25], [26]. For example, on a highway, wrong longitudinal pre- 
dictions are far less dangerous than wrong lateral predictions, 
which might result in an autonomous vehicle reacting to a lane 
change too late. Second, the common approach of randomly 
selecting test cases from datasets [10], [12] is problematic due 
to the heterogeneity of those datasets, which typically include 
samples that can vary widely in their importance and difficulty. 
Such samples can range from a single vehicle following a 


lane to complex space-sharing conflicts with multiple agents 
at unsignalized intersections, where the behavior of human 
agents is often multi-modal and can change rapidly. Rare edge 
cases, where some traffic participants are very aggressive or 
even violate traffic rules and accidents are far more likely [27], 
[28], are also possible. But with randomly selected test cases, 
potentially poor performance in the most important situations 
can be compensated by good performance in less important 
but more numerous ones. For these reasons, a useless model 
might appear promising, which hampers further progress. 

One possible approach to overcome these issues is includ- 
ing a path planning algorithm in evaluations, as suggested 
by Ivanovic and Pavone [25]. However, this adds further 
computational loads to an evaluation and only addresses the 
symmetry of common metrics, neglecting the varying difficulty 
and importance between testing samples. To cover both these 
problems, we suggest to instead narrow the evaluation to the 
most critical situations. In particular, we focus the evaluation 
of behavior prediction models on gap acceptance scenarios, 
a concept that encompasses most of the safety critical inter- 
actions between autonomous vehicles and humans [29]. In a 
gap acceptance scenario, an autonomous vehicle follows a 
particular trajectory over which a second traffic participant 
(e.g., a pedestrian or another vehicle) can move either in 
front of or behind the autonomous vehicle. Here, the first 
option (i.e., the human accepting the gap) would require the 
autonomous vehicle to potentially alter its trajectory planning, 
while the latter one of rejecting the gap would not. Due to 
the narrow focus, estimating the importance and difficulty of 
a particular situation can become much more straightforward. 
Additionally, as the human has only two options to decide 
between, such gap acceptance scenarios allow the usage of 
simple binary prediction models to estimate if the human 
behavior requires an adjustment of trajectory planning. 

Many binary prediction models have been developed for 
gap acceptance scenarios. However, those mostly focus and 
are trained on a specific scenario, such as the street crossing 
behavior of pedestrians [30]-[32], the crossing behavior of 
cars at intersections [33]-[37], or lane change decisions on 
high ways [38]-[41]. Additionally, the development of those 
models still suffers from similar problems as the trajectory 
prediction models, such as the anisotropy of common metrics 
like accuracy. A random selection of test cases [31], [35] and 
neglect of the varying importance of different samples are 
also common. Additionally, in contrast to trajectory prediction 
models, which are commonly compared to each other on 
accepted benchmarks (such as on the ETH dataset [10]-[12] 
when predicting pedestrian crowds), an equivalent benchmark 
does not exist for binary prediction models [42]. Instead, those 
models are mostly trained and tested on datasets exclusive to 
the respective work and are—if at all—only compared against 
a small number of other selected models [30], [31], [35]-[37], 
[43]-[47]. 

Our goal in this work is to overcome these limitations of 
the current literature on both binary and trajectory prediction 
models and enable a meaningful evaluation of these models 
in gap acceptance scenarios. Such an evaluation cannot only 
make the development of trajectory prediction models more 


goal-oriented, but also help determining to what extent the 
inclusion of specialized binary prediction models can improve 
the performance and reliability of general trajectory prediction 
models. To that end, this paper makes three main contributions: 


e We develop a formal description of the gap acceptance 
process that applies to all possible gap acceptance scenar- 
ios. This description includes a detailed timeline of gap 
acceptance (Section II), which serves as a foundation for 
methods to estimate the criticality concerning the safety 
of each sample, which is a fundamental requirement for 
selecting meaningful test cases. 

e We devise a framework for evaluating behavior prediction 
models in gap acceptance scenarios. This framework 
allows the integration of varied gap acceptance datasets, 
models, and evaluation metrics (Section III). It is inspired 
by similar works by Miiller et al. for computer-based 
image retrieval algorithms [48], by Zaffar et al. in the 
field of visual place recognition [49], or Cao et al. for 
evaluating robustness to adversarial attacks of trajectory 
prediction models [50]. This approach would allow one 
to test any chosen model and compare it to other models 
in any possible environment more easily. Simultaneously, 
the framework allows precise control over the splitting 
the data into training and testing samples to evaluate the 
models’ reliability in the most difficult gap acceptance 
situations. 

e Using the proposed framework, we compare several 
prediction models in their performance on three gap 
acceptance datasets, with a focus on their performance 
in safety-critical edge cases (Section IV and Figure 1). 
Consequently, other researchers can easily access the al- 
ready implemented models to compare their own models 
against. Additionally, we compare models dedicated to 
predicting binary gap acceptance decisions to state-of- 
the-art trajectory prediction models to test the hypothesis 
that including dedicated binary models for gap accep- 
tance problems could improve such trajectory prediction 
models. 


II. DEFINING GAP ACCEPTANCE 


To estimate the difficulty and control the importance of 
prediction task over disparate datasets, a coherent formal 
definition of gap acceptance scenarios is needed. Here we 
propose such a definition. 

In a gap acceptance scenario, an autonomous vehicle Ve— 
also referred to as the ego-vehicle—plans to follow along a 
certain trajectory Pg along which it has the right of way. 
This trajectory overlaps with the trajectory Pr of another, 
human-controlled vehicle Vr (also named target vehicle). 
Such an overlap might, for example, happen at unsignalized 
intersections, where the agents move along crossing streets or 
on highways, where Vr wants to merge into the faster lane 
along which Vg is driving. In such situations, Vp can decide to 
move onto Pp either in front of or behind Vg, i.e., to accept or 
reject the gap offered by Vg. We assume that Vz has the right 
of way along Pg, as otherwise, traffic rules would obligate it 
to preemptively yield. 


A: The human agent Vr accepts the gap offered by the autonomous vehicle Vp safely (ts < ta < terit). 
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Fig. 2. The characteristic time-points of the gap acceptance process—defined by the relation of the agents to the contested space (in purple)—of two different 
examples of gap acceptance, intersection crossing (upper panels) and lane changing (lower panels). In both examples, the autonomous vehicle Vz (in red) 
offers a gap to the human driven vehicle Vr (in blue). In total, three cases are possible (A — C), depending on t4, i.e., the time the target vehicle enters the 
contested space. In B, the accepted gap decision by the human is considered to be unsafe, as Vz cannot guarantee the avoidance of a crash, having potentially 
not enough time for braking. Meanwhile, in C, it might be possible that Vp crashes into Vg. 


Under these conditions, a gap acceptance scenario is char- 
acterized by the spatio-temporal relation between the agents 
towards the so-called contested space [29]. There, the trajec- 
tories Pg and Pr would start to overlap, making this the 
location of a potential collision. An example is the overlap 
of two crossing lanes at an intersection. However, in specific 
scenarios (such as changing lanes on highways), the exact 
location of the meeting point of Pg and Pr can be at the 
discretion of the human agent Vr and therefore be unknown 
before the actual decision. In such cases, we then place the 
contested space under the assumption that Vr would decide 
to accept the gap immediately. For example, in the scenario 
of highway lane changes, the contested space would therefore 
move in parallel to Vr, only stopping to move once Vr starts 
to enter the lane of Vg. 

The following time points then characterize the gap accep- 
tance process (illustrated together with the contested space in 
Figure 2): 
tg: At the starting time ts, there is no longer any other 

vehicle along Pg in between Vz and the contested 
space. This is primarily the case when the vehicle 
preceding Vg leaves the contested space, but other 
options are imaginable, like the vehicle in front of Vz 
leaving Pr. 

tc: At tc, Vg starts to enter the contested space, closing 
the gap. 

: A prediction of tc by the ego vehicle, made at t, needed 
to allow gap size estimations during online applications. 
While this is scenario-dependent, the following condi- 
tion has to be satisfied so that an open gap can still be 
characterized as such, even if Vg is moving away from 
the contested space: 


sgn (tc(t) — t) = sgn (to — t) . 


The last time Vg can safely prevent a collision even 
in the case of malicious behavior by Vr; e.g., at this 


terit: 


point, a safe braking process could bring Vg to a stop 
before the intersection. terit can be formalized in the 
following condition: 


Atp(t) = ta(t) — t — trake (t) = 0 (1) 


Here, the required braking time terake is not based on 
the maximum deceleration Vz is technically capable of, 
but instead, one that is considered safe. The time point 
terit is also the last time a prediction can be considered 
useful for further trajectory planning. 

ta: At t4, Vr enters the contested space, potentially 
accepting the gap. 

We count Vr as rejecting the gap if Vg is allowed to move 
first onto the contested space, i.e., if tc < ta. If this is not 
the case and the human moves first (tą < tc), the gap is 
considered accepted. 


IHI. FRAMEWORK FOR BENCHMARKING GAP ACCEPTANCE 
MODELS 


After defining the fundamental characteristics of a gap 
acceptance scenario, we will use this groundwork to build 
a framework for benchmarking gap acceptance models. This 
framework should allow for the performance assessment of 
any model M on any dataset D according to any evaluation 
metric Æ. The following requirements need to be met for such 
an assessment to be possible and meaningful: 


R1 The time point to of a prediction must be controllable, 
as it influences not only the difficulty of the prediction 
but also its importance due to changing consequences 
of a false prediction. 

To evaluate models in critical situations, the framework 
should allow control over splitting all available samples 
into training and testing sets. 

Models producing (as well as metrics evaluating) binary 
or trajectory predictions should fit into the framework. 
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Fig. 3. Functionalities of the proposed framework. To evaluate a model M on a dataset D with the metric E and splitting method S, a method for determining 
the prediction time to has to be chosen first (Section MI-A). This method is used to extract input and output trajectories (Dz and Do respectively) from 
the dataset D (III-B). Those samples are split in training and testing set using the splitting method S (III-C), with the training one (D7 train, Do train) used 
to train a model M (III-D). Subsequently, predictions Do p,tes: are made for the test samples Dy test with the trained model (III-E). It might be necessary 
to transform these predictions into another form (III-F), before the metric Æ compares them to the true outputs Do test (I-G). These steps produce two 
outcomes (red diamonds): the similarity Ç between training and test data, provided by S, and the model performance F' according to metric E. 


Therefore, the framework should allow transformations 
between those forms of model output. 
Considering these requirements, seven functionalities—each 
assigned to one of the four modules from Figure 3—will 
constitute the proposed framework. Those are presented in the 
order in which they are employed in the process of a single 
evaluation. 


A. Setting the prediction time — Metric E 


To satisfy requirement R 1, this functionality enables the 
selection of the time-point to at which the prediction has to be 
made. As the prediction time influences both the importance of 
such predictions and the meaningfulness of different metrics 
(Appendix A-A), this functionality is attached to the metric 
module. 

Currently, three methods are implemented into the frame- 
work to determine fo: 

e Prediction at the initial opening of the gap: tp = tg. The 

prediction is made when the gap first appears, and this is 
the baseline most commonly used in the literature [35], 
[39], [44]. 

e Prediction at gaps with fixed size: tọ = min{t|tc(t) — 
t = At}. The prediction is made when the gap offered 
has a uniform duration At, which should make every pre- 
diction equally difficult due to similar prediction horizon. 

e Last useful prediction for critical gaps: to = terit — te. 

The prediction is made at the last point in time when it 
would still be useful, with te being used to allow time 
for calculations. 
Here, to has to be calculated without hindsight knowledge for 
online predictions at a time where tc or t4 are not known. 
A discussion on the impact of the different approaches on the 
resulting datasets can be found in Appendix A-B. 


B. Extracting input and output — Dataset D 


Next, the input and output data for each sample are extracted 
from a given trajectory Xr, which includes positions at differ- 


ent time points T from different actors V = {Vg, Vr, Vi,...}: 
Xr = {a(t)|teT} 
a(t) = {ai(t)|Vi € V} 
z(t) = (a(t), yi(t)) € R? 
This functionality, requiring access to the raw data from the 
scenario and thus being part of the dataset module, consists 
out of eight consecutive steps: 

e The characteristic time points tg, t4 and tc are extracted 
from given trajectories Xy. Furthermore, tc and based 
on this trake are estimated at every time-point in T. This 
extraction will be scenario-specific, while the following 
steps can be applied generally. 

e The binary decision a, with a = 1 for accepted gaps 


(ta < tc) and a = 0 for rejected gaps, is extracted. 
e tit IS extracted next, with 


ts Atp(ts) < 0 
torit = ta +te min{Atp(t)|ts <t<ta}>0> 
tp else 
where 


tp = min {t |t > ts A Atp(t) = 0} 


satisfies both requirements in Equation (1). 

e The time of prediction to is calculated accordingly to the 
method chosen previously. Only samples that meet the 
condition 

ts < to < min {ta, torit} (2) 


are included in the final dataset, to ensure that gaps are 
already offered, Vr has not made a decision yet and that 
the prediction is still useful. 

e The number of input time-steps n; and the time-step size 
ôt are chosen. One also has to determine the number of 
output time-steps no, setting the prediction horizon as 
noot. Here, no is set so that it can always be determined 
if the gap is closed or accepted. 


e Based on to, nz, No, and dt, the time-steps for input and 
output data are selected, named Ty and To respectively: 


Tr = {to + it |i € {—n; +1,..., 0} 
To = {to + iôt |i € {1,...,no}} 


e For those time-steps, the input trajectories Xr, and 
output trajectories Xr, are extracted from Xr, using 
interpolation if necessary. For the output, only the trajec- 
tory of the target agent Vr is required, as its behavior is 
to be predicted. 

e Certain domain information k is collected, for instance, 
the location at which the trajectories Xr were collected 
or the test subjects involved in gathering the data. 

The input data Dpr then includes from each sample the 
input trajectory Xr, and the corresponding time-steps Ty. 
Meanwhile, the output data Do takes the output trajectory 
Xv, and the corresponding time-steps To, as well as the 
binary decision a, the time of accepting the gap t4, and the 
domain information k from each sample. More mathematical 
details are in Appendix A-C. 


C. Creating training and testing set — Splitting method S 


To fulfill requirement R 2, the splitting method S, separating 
the given samples into training and testing sets, is a crucial part 
of the framework. As this functionality should be independent 
of the scenario, it is part of the separate splitting module. 
Examples for this range from random splitting to methods 
taking into account all the information in Dy; and Do. 

Besides the potential similarity measure ¢ of training sam- 
ples to the training set, this functionality creates the training 
data Dr train and Do train aS Well as the test data Dy tes and 
D O, test» 


D. Training the model on the training set — Model M 


After splitting the samples into training and testing sets, the 
model has to be trained, which is one of the functionalities of 
the model module that has to be individually implemented 
for each model. If the model, for instance, requires input 
velocities, extracting those from the given position data Xr, 
and Xr, is done here. 


E. Making the predictions for the testing set — Model M 


For every sample from the input testing set Dries, a 
prediction dpreq is made by the trained model, with all dprea 
constituting the set of predictions DoP tes. As such predic- 
tions rely on a trained model, this functionality is also part of 
the model module. Depending on this model, each prediction 
dprea Might take different forms. Three different forms of 
stochastic predictions are implemented into the framework. 
e Binary prediction: dpred = Aprea; 1.€., only the probability 
Aprea € [0, 1] of Vr accepting the offered gap is predicted. 

e Timing prediction: dpred = {Gprea, tA prea }, i.e., not only 
pred 18 predicted but also the time tA, preq at which the 
gap might be accepted. 

e Trajectory prediction: dprea Xp prea, ie., the full 

trajectory of Vr is predicted. This prediction consists of 


Np trajectories XTo „—all equally likely—to represent 
probabilistic outputs. 
More mathematical details are in Appendix A-D. 


F. Transforming the predictions — Dataset D 


To fulfill requirement R 3, we then need to be able to 
transform a prediction dyr.q to any other form we might 
require. This functionality is part of a specific dataset, as 
this entails the required context information to, for example, 
classify different trajectories. 

These transformations rely on three different functions T;: 
Ti: takes the trajectory prediction X To prea and then pro- 

vides {Qprea, tA prea}. This is similar to the extraction 
of the time points in Section III-B. 

Tz: takes the prediction {Qprea, tA, prea} and then provides 
the trajectory prediction X7, prea, consisting of np 
trajectories from the predictions of two conditional 
trajectory prediction models trained only on accepted 
and rejected gaps respectively. These are selected so 
that Tı (X 7, prea) results in the original inputs. 

T3: takes a binary prediction dpreq and provides the pre- 
dicted time of accepting the gap tA prea, by extracting 
it from the prediction of a trajectory prediction model 
trained only on accepted gaps. 

These three functions (exact implementation in Appendix A-E) 

are enough to enable any transformation between prediction 

forms, as seen in Table I. 


G. Evaluating the predictions — Metric E 


This functionality—the main part of the metric module— 
implements the performance evaluation, comparing the actual 
outputs Do test With the predicted outputs Dop test. It returns 
either a combined value F or instead a separate value for each 
sample dprea E€ DoP test, resulting in the output F. 


IV. BENCHMARK IMPLEMENTATION 


We implemented the framework described above by link- 
ing together several datasets, models, splitting methods, and 
metrics. These were chosen not to comprehensively cover 
all possible gap acceptance scenarios and prediction models 
but to demonstrate the flexibility and utility of the proposed 
framework. Still, our implementation can already serve as 
a benchmark for new prediction models. This section only 
presents an overview of the implementation, with full technical 
specification provided in supplementary materials. 


A. Datasets 


Different datasets are implemented into the framework 
(Table II), including data recorded on real roads as well 
as data from a driving simulator study. The naturalistic 
datasets used here are captured by drones and distinguished 
by accurate position labeling. They cover lane changes on 
German highways (the highD dataset [51]) and roundabouts 
(the rounD dataset [52]). The L-GAP dataset covers left turns 
at unsignalized intersections through oncoming traffic recorded 
in a driving simulator [35]. It has been chosen due to the 
simplicity of its environment, contrasting the more complex 
scenarios in the naturalistic datasets. 


TABLE I 
TRANSFORMATION BETWEEN ANY POSSIBLE PREDICTION TYPES dprep USING THE THREE IMPLEMENTED FUNCTIONS T}. 


Input = Output | pred 


{pred tA, prea} XTo „pred 


Binary prediction pred = 
Timing prediction {aprea, ÈA prea} = 


Trajectory prediction X To ,pred in Ti (X To pred) 


{ pred; T3 (prea) } T2 ({ aprea; T3 (aprea) }) 
= T2 ({ aprea; ta prea }) 
Ti (XTo prea) = 


TABLE II 
THE NUMBER OF ACCEPTED GAPS N4 AND REJECTED GAPS N- 4 IN THE IMPLEMENTED DATASETS, IN THE FORM: N4 — N- 4 (MEDIAN tc (to)). THE 
NUMBERS DEPEND ON THE METHOD FOR CHOOSING THE TIME OF PREDICTION to 


Dataset | Initial gaps at their opening 


Gaps with fixed size Critical gaps 


highD (Lane changes) 

highD (Lane changes - restricted) 
rounD (Roundabout) 

L-GAP (Left turns) 


1406 — 7026 (8.98) 
1406 — 1001 (12.98) 
662—917 (2.5s) 
703 — 724 (4.6s) 


1) Lane changes: Here we focus on lane changes of the 
target vehicle Vr toward a faster lane to the left, along which 
the ego vehicle Vg driving there has the right of way. While 
it could be argued that predictions in such situations could be 
simply based on turn signals, one cannot rely on human drivers 
to correctly use these [47]. As a source of lane change data, 
we used two versions of the highD dataset, full and restricted. 


The full highD dataset, not employing any filters, is heavily 
biased toward trajectories without a lane change. This is not 
a problem per se, but in such trajectories it is not known 
whether the target vehicle Vr even had an intention to change 
lanes (i.e., if there was a gap acceptance situation in the first 
place). For this reason, in addition to the full highD dataset, we 
added a restricted version of it which only included samples 
for which it can be inferred that the target vehicle Vp indeed 
considered a lane change. Criteria are either a lane change 
of Vr after Vg has passed or Vr braking to not collide with 
the preceding vehicle instead of changing lanes. Still, in both 
versions of highD, the gaps are always accepted with large 
safety margins (Table II). 


2) Roundabout: In the rounD dataset, the target vehicle 
Vr has to enter a roundabout, which it can do in front of or 
behind the ego vehicle Vz already in the roundabout. As the 
trajectories are recorded in Germany, the ego vehicle inside 
the roundabout has the right of way. Compared to highD, this 
dataset is far more balanced between accepted and rejected 
gaps, but still only includes few critically accepted gaps. 


3) Left turns: In the L-GAP dataset [35], the driver of 
the target vehicle Vr intends to turn left at an intersection. 
The driver had to decide whether to do this in front of 
or behind the ego vehicle Vg approaching the intersection 
from the opposite direction with the right of way. While the 
number of samples in this dataset is comparatively small, they 
are relatively balanced between accepted and rejected gaps. 
Also, they include many gaps accepted after terit (Table II). 
Nonetheless, as Vr starts in an idling position at some distance 
to the contested area, this might not be the most challenging 
dataset, as an onset of movement before tọ in most cases is 


461 — 1568 (11.88) 
392 — 241 (8.7s) 
168 — 168 (2.9s) 
496 — 572 (3.58) 


0 — 7025 (0.8 s) 

0 — 1000 (0.7 s) 
33 — 913 (1.0s) 
369 — 723 (2.38) 


an apparent indicator of Vr intending to accept the gap. 


B. Test-train Splitting Methods 


Two splitting methods are implemented, without a method 
for calculating the similarity measure ¢. Nonetheless, to enable 
at least a qualitative approximation of a model’s robustness, 
the methods are designed to produce testing sets of varying 
difficulty for the prediction models. 

The easier variant performs a stratified random splitting, 
while the second, more extreme method sorts the most unin- 
tuitive behavior of the target vehicle into the testing set (e.g. 
accepting a very small gap or rejecting a very large gap). 

In both cases, the testing set includes 20% of the samples 
and the training set the remaining 80%. 


C. Models 


The benchmark includes two state-of-the-art trajectory pre- 
diction models 


e Trajectron++ (also referred to as T+), a deep- 
learning model mainly based on long-short-term memory 
cells [10]. 

e AgentFormer (AF), a deep-learning model based on trans- 
formers [12]. Compared to T+, it has ten times more 
trainable parameters. 


For the binary prediction models for gap acceptance, there 
is, aS mentioned above, a lack of a common benchmark, 
making the models’ selection more contentious. Four models 
have been selected nonetheless: 

e Logistic regression (LR) is commonly used for predicting 
human gap acceptance decisions [32] and is therefore 
included as a simple baseline. 

e Random forests (RF) have been shown to outperform 
other approaches such as logistic regression and standard 
decision trees in gap acceptance prediction [33]. 

e Deep belief networks (DB), also used previously to 
predict human gap acceptance decisions [39]. 


e A metaheuristic model based on combining all other five 
models above (MH); previously a similar approach for 
lane changes has been shown to outperform each of the 
models included in it [41]. 


The benchmark does not include any dedicated timing pre- 
diction models yet, as their primary representative, the drift- 
diffusion model [31], [35], can currently not be trained on 
datasets with a large number of unique samples in a reasonable 
amount of time. Nonetheless, to allow for future expansion of 
the benchmark, the framework has been designed with such 
models in mind. 


D. Evaluation Metrics 


We have included several metrics that characterize models 
in terms of quality of binary predictions (accept/reject gap) as 
well as full trajectory predictions. The following metrics are 
commonly used in the literature: 


e Accuracy: This metric is a widespread method to evaluate 
the performance for binary prediction models [33], [39], 
[41]. However, accuracy is a symmetric metric, i.e. it is 
unable to differentiate between false negative and false 
positive predictions and is best used in cases where to < 
terit, aS the consequences of false predictions are not too 
different there (Appendix A-A). 

e AUC: This metric for binary prediction models, the Area 
Under Curve of a receiver operating characteristics curve, 
addresses one central point of criticism of the accuracy 
metric, namely its sensitivity to biases in the testing set. 
Nonetheless, like the accuracy metric, it does not consider 
the potentially differing severity of false predictions. 
Hence, we only apply it to rate a prediction model’s 
performance when to < terit- 

e ADEg: This metric, the Average Displacement Error of 
the n,§ least erroneous predicted trajectories, is com- 
monly applied to trajectory predictions [10]-[12], with 
B = 1 or 6 = 0.05 being used in this work. As this 
metric also does not take into account the severity of 
different false predictions [25], [26] and requires equally 
long prediction horizons as well, it is only applied for 
constant gap sizes (to = min{t| tco(t)-—t = At} & tert). 


In addition to the above metrics, we also propose a novel 
metric that takes into account potential consequences of a 
wrong prediction. 


e TNPR: The True Negative rate under Perfect Recall is 
applied to binary predictions, where the threshold for 
categorizing a prediction Gpreq as an accepted gap is set so 
low that there are no false negative predictions (perfect 
recall). It considers the different consequences of false 
negative and positive predictions, assuming that collisions 
are to be avoided at all costs. It consequently estimates the 
likelihood of not having to brake needlessly for rejected 
gaps while guaranteeing safe interactions. This metric is 
designed specifically for critical gap size with to © terit, 
as at earlier time points the need for a perfect recall would 
be unreasonable. 


V. RESULTS 


Our benchmark provides insights into performance of tested 
models under different conditions (Figures 4, 5). There, when 
provided with more input time steps (ny = 10 vs. ny = 2), i.e., 
more information to extract signs of future behavior from, the 
models’ predictions were generally better. These observations 
are as expected as the worsening performance of models tested 
on the most unintuitive samples from the extreme splitting 
case. There, the models have to extrapolate, a typically far 
more difficult task than the interpolation needed for predicting 
random samples similar to the training set [53]. The poor 
performance on unintuitive samples is especially pronounced 
when looking at the TNPR at critical gaps (Figure 5), where 
no model could substantially outperform a random predictor 
on both datasets. 

Furthermore, it can be seen in Figure 4 that accuracy, as it 
depends on the distribution of accepted and rejected gaps in 
the test set, is not a suitable metric for comparing model’s 
performance on different datasets and different prediction 
times. The same can be said of the ADE. Instead, AUC seems a 
far more reliable metric, being the only metric that accurately 
captures that most models have fewer problems with the 
restricted lane change scenario than the full one. 

When comparing the difficulty of different scenarios, it can 
be seen that, generally, the prediction of human behavior at 
roundabouts seems to be easier than predicting lane change 
behavior. For those two scenarios, it can also be observed that 
predictions made for constant gap sizes seem slightly more 
accurate than those made for initial gaps. The only difference 
here is the left turn scenario, where this difference is far more 
pronounced, especially for ny = 2. We can explain those 
difficulties by the initial lack of motion in the target vehicle 
Vr. The difference is lesser for ny = 10, as the prediction is 
made for this scenario at tọ = tg + (nr — 1) x ôt (instead 
of to = tg) due to the lack of trajectory data before tg. 
Consequently, for ny = 10, an onset of motion is far more 
likely to be recorded than for ny = 2. Overall, predictions 
made for constant time gaps on the left turn scenario are 
relatively easy, while they are relatively hard when made at 
the opening of the gap. 

Besides the general trends mentioned above, which are 
mostly valid for all models, we can also compare those models 
among themselves. When comparing the trajectory prediction 
models, it can be seen that the Trajectront+ model (T+) 
consistently outperforms the AgentFormer model (AF). This 
is surprising, as AF previously outperformed 7+ on pedes- 
trian trajectory prediction benchmarks [12]. This contradiction 
might be explained by over-fitting the many trainable param- 
eters for AF on relatively small datasets here. Meanwhile, the 
logistic regression (LR) model is often the most promising 
approach for binary prediction models, especially when tested 
on random samples. Together, those results indicate that in- 
creasing the complexity of such models and their number of 
parameters might not be a panacea, with simpler models being 
more promising, especially if datasets are relatively small. 

When we compare binary models against trajectory predic- 
tion models, we can observe differing behavior for different 
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Fig. 4. The results of evaluating behavior prediction models in gap acceptance 


scenarios on different datasets, with prediction being made either at the initial 


opening of the gap (to = tg) or with fixed gap sizes (to = min{t|tc(t) — t = At}. The color indicates the number of input time-steps ny given to 
the models, and the marker type denotes the splitting method, which can be random or extreme. The dashed gray lines indicate the performance F, of a 


uniformly random binary predictor. 
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Fig. 5. The true negative rate under perfect recall (TNPR) of different 


prediction models, using the same visualizations as in Figure 4, tested on 
the last useful predictions (to = terit — te). 


scenarios. On the one hand, the best performance on the lane 
change datasets is generally achieved by a binary prediction 
model, while on the other hand, the reverse is the case on the 


other two datasets. One main difference here is the prediction 
horizon (At ~ 10s and At < 4s respectively), which might 
explain those results. That the average displacement errors 
are much more noticeable in the lane change scenario also 
supports this explanation, further showing the difficulties of 
using trajectory prediction under those conditions. 

Lastly, when evaluating the promise of conditional trajectory 
prediction models, we can hold that the benefits are negligible 
in most cases, except at roundabout and left turns for ADE 05. 
Nonetheless, due to the problems with that metric, more than 
those results are needed to render a final judgment. Similarly, 
due to the small size of datasets and the low number of models, 
the previous results should also be treated carefully. 


VI. CONCLUSION 


We proposed a framework that connects previously disparate 
datasets, models, and metrics in the benchmark for testing 


behavior prediction models in gap acceptance scenarios. We 
demonstrated its potential by comparing two state-of-the-art 
trajectory prediction models with several binary gap accep- 
tance models. Additionally, we showed that relying on the 
characteristic time points of gap acceptance scenario to select 
the most unintuitive samples in the splitting module is a 
promising approach to analyzing model generalization, as seen 
by the general decrease in performance for models trained on 
those samples. Our framework is open-source and specifically 
designed in a way to simplify adding new datasets, splitting 
methods, models, and evaluation metrics, which allows re- 
searchers to expand it in future. 

One particularly important addition to the benchmark would 
be a datasets containing more critically accepted gaps, as 
this would allow for an increased meaningfulness of metrics 
applied to last useful predictions. Additionally, a metric better 
aligned with the main purpose of a prediction model as a part 
of an autonomous vehicle is still needed. Likewise, currently 
there exists no method for calculating the similarity between 
testing and training set ¢; the future addition of this would 
permit a quantitative comparison of a model’s robustness 
against unintuitive test samples. Lastly, one could expand the 
framework to provide scenario-independent inputs similar to 
to, which would make training a model on two unrelated 
datasets simultaneously possible, leading to a better estimation 
of the models generalizability. 

We acknowledge that testing a model in gap acceptance 
scenarios alone is necessary, but not sufficient for justifying 
its usage in actual vehicles. Consequently, expanding the 
framework to non-gap-acceptance scenarios is an important 
avenue for future research. This will enable a more holistic 
testing but only for models predicting (and metrics evaluating) 
trajectories. Nonetheless, performance of models on non- 
gap acceptance scenarios should still be given lower priority 
compared to gap acceptance scenarios which are more safety- 
critical. 

Our results resonate with the recent literature on hybrid 
AI [54], [55], showing that including binary prediction models 
in specific scenarios might make data-driven trajectory predic- 
tion models more reliable, especially in accurately predicting 
dangerous situations. However, especially for the unintuitive 
and safety-critical edge cases, most models often performed 
only slightly better than a random predictor at best. Therefore, 
there currently seems to be no model that a trajectory planning 
algorithm in any scenario can rely on to substantially increase 
the effectiveness of an autonomous vehicle’s driving style, 
necessitating further research into such models. 


APPENDIX A 
DETAILING THE FRAMEWORK FOR BENCHMARKING GAP 
ACCEPTANCE MODELS 


A. Evaluating the predictions - Metric E 


Due to the differing consequences of false negative and false 
positive predictions when using binary predictions Qpreq, there 
are limitations on which metrics are usable at certain to. For 
to < terit, the consequences of a wrong prediction are gener- 
ally minor, as time is left to wait for future information before 


more significant changes to trajectory planning are necessary. 
Furthermore, even if the target vehicle would immediately 
accept the gap after to, the necessary response is likely neither 
uncomfortable nor risky. Consequently, symmetric metrics can 
be used here. 

For to © terit however, no more time for further observations 
is left, resulting in far more severe consequences for both 
false positive and false negative predictions. A false positive 
prediction would unnecessarily result in a harsh and un- 
comfortable braking maneuver. Meanwhile, a wrong negative 
prediction leads to an unsafe gap acceptance maneuver, with 
the safety of the interaction between Vg and Vr no longer in 
the control of the autonomous vehicle Vg. Accidents or the 
need for dangerous emergency maneuvers, which could result 
in material damage or even bodily harm, are then possible. 
As the latter should be avoided at all costs, a false negative 
prediction at this time is far more consequential, which should 
be reflected in the evaluation metric. 


B. Setting the prediction time - Metric E 


The method for determining the time tọ can impact the 
size of the resulting dataset (Table II) due to the condition 
from Equation (2). Namely, for constant gap size (to = 
min{t|to(t) — t = At}), the number of available samples 
will be reduced, as all gaps with an initial smaller gap size 
(tc(ts) — ts < At) will be excluded. The same is the case 
for gaps already accepted before to. For critical gap sizes 
instead (i.e., to = terit — te), all gaps accepted before terit will 
be excluded, leading to extremely biased datasets, sometimes 
even removing all accepted gaps. 


C. Extracting input and output - Dataset D 


Trajectories Xy are only provided over the interval Ir = 
[max T, min T]. The conditions Cs, Cc, and C4 are then 
used to extract ts, tc, and ta respectively, with Tc, = 
{t| Ct) Yt € Ip}: 


inT To; =Ø 
ts = Ts(Xr) = { i 
max Te, else 
to(maxT) To, = Ø 
to = To(Xr) = cl ) ` (A.3) 
min To, else 
Tt+t. To, = 
fs TR) = max Lf + Ca 
min Tea else 


If Teco = Ø ^ Te, = Ø, the sample will be excluded from the 
dataset, as no decision can be observed. If not, the number of 
output time steps no is calculated by the following equation: 


to — to 
a a 
This is chosen so that T4 (Xr) < max To is sufficient to 
categorize the gap as excepted. 


(A.4) 


D. Making the predictions for the testing set - Model M 


The predicted time tA, prea are the decile values of the 
underlying probability distribution represented by the set T4: 


ta prea = {Q1 (p) |p € {0.1, 0.2,...,0.9}} = Qo (Ta) € R? 

(A.5) 
Here, Q+, is the quantile function associated with this under- 
lying distribution of t4. Meanwhile, the predicted stochastic 
trajectory X To, prea is expressed by using np deterministic 
trajectories XT, p: 


XTo pred = {X7,1; iia | XTo,n, } (A.6) 


E. Transforming the predictions - Dataset D 


When implementing the transformation of a prediction dpreq 
into another form, two instances of the trajectory prediction 
model Trajectron++ [10] are used, namely M4, trained on all 
samples from the specific dataset D (Dy; and Do) where a = 
1, and M-a, trained on the remaining samples with a = 0. 
Furthermore, the function f, is defined: 


1 T4 (XTo, p) Š max To 


fa (XTo,p) = (A.7) 
0 else 
Tı: The use of fa results in 
1 
pred = — 5 fa (XT p) 
nws, 
tA prea = Qo {TA (XTo.p) | fa (XTo.p) = 1}) - 
(A.8) 
Tz: Ma, and M-a are used to respectively predict 
X To, prea, a ANd X To prea; resulting in 
XTo pred = {XTo ,p,A|P = Ra} N (A.9) 
{XT ,p,A|P € Ra}, 
where 
R-4 = R(n, (1 = Opred) {p | fal XTo,p ~A) = O}, 1) 
Ra = R(Npdprea, {P | fa(XTo,p,4) = 1}, Wa). 
(A.10) 
Here, R(m, M, W) is a function that randomly selects 
m samples from M under weight W. W4 is chosen so 
that the sum of all weights of all TA (XTo,p,a) in each 
quantile of tA prea is identical. This approach of using 
conditional trajectory prediction models is inspired by 
Xie et al. [39]. 
T: Here, one uses MA to get XTo prea, A, resulting in 
{pred As tA pred} = Tı (XTo pred, A) : (A.11) 
In all cases, p € {1,...,n,} can be assumed. 
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